If optimization flags feel like magic, inspect assembly. Every speedup claim here ties back to code shape changes you can see in the generated instructions.
Baseline kernel used in the examples
void saxpy(float* y, const float* x, float a, int n) {
for (int i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
We keep the same loop and only change flags. That way, assembly diffs show exactly what each option buys you.
-O0 (debug shape)
Why it helps: keeps source-to-assembly mapping simple for stepping and breakpoints.
Assembly translation: more stack traffic and fewer register optimizations.
stp x29, x30, [sp, #-48]!
mov x29, sp
...
ldr s0, [x1, x3, lsl #2]
str s0, [sp, #28] // spill
ldr s0, [sp, #28] // reload
bl helper // call often remains
When it backfires: numbers from this build do not represent production speed.
-O2 (balanced release baseline)
Why it helps: enables strong scalar optimizations without extreme code growth.
Assembly translation: fewer spills, tighter loop body, better scheduling.
mov x3, #0
.Lloop:
ldr s0, [x1, x3, lsl #2]
ldr s1, [x0, x3, lsl #2]
fmadd s1, s0, s2, s1
str s1, [x0, x3, lsl #2]
add x3, x3, #1
cmp x3, x2
b.ne .Lloop
When it backfires: hidden UB can appear only in optimized builds.
-O3 (more aggressive loop transforms)
Why it helps: pushes harder on vectorization/unrolling for compute-heavy hot loops.
Assembly translation: often emits NEON loop body plus scalar tail.
dup v2.4s, w3
.Lvec:
ld1 {v0.4s}, [x1], #16
ld1 {v1.4s}, [x0], #16
fmla v1.4s, v0.4s, v2.4s
st1 {v1.4s}, [x0], #16
subs x2, x2, #4
b.gt .Lvec
When it backfires: bigger code can hurt I-cache and reduce gains on branchy code.
-Ofast / -ffast-math
Why it helps: allows algebraic rewrites that strict IEEE mode blocks.
Assembly translation: more fusion and reassociation opportunities.
// strict math
fmul s0, s1, s2
fadd s0, s0, s3
// fast-math path
fmadd s0, s1, s2, s3
When it backfires: NaN/infinity/rounding behavior can change. Use only if numerics allow it.
-flto (link-time optimization)
Why it helps: compiler sees across translation units and can inline across file boundaries.
Assembly translation: direct call can disappear and turn into straight-line ops.
// no LTO
bl _Z12update_blockPfPKf
// with LTO
ld1 {v0.4s}, [x1]
ld1 {v1.4s}, [x0]
fmla v1.4s, v0.4s, v2.4s
st1 {v1.4s}, [x0]
When it backfires: longer links and larger binaries on huge codebases.
-fprofile-generate / -fprofile-use (PGO)
Why it helps: hot path data drives block ordering and branch prediction friendly layout.
Assembly translation: hot path tends to become fall-through; cold blocks move away.
cbz w0, .Lcold
// hot path falls through
...
b .Ldone
.Lcold:
...
When it backfires: stale or unrepresentative training runs can produce slower code.
What each flag controls
-march: ISA features allowed (for example SVE availability).
-mcpu: target core plus tuning defaults; most practical single switch for fixed hardware.
-mtune: scheduling model without changing ISA baseline.
Assembly translation
If SVE is enabled in -march, GCC can emit predicate-driven vector loops:
whilelt p0.s, x3, x2
ld1w z0.s, p0/z, [x1, x3, lsl #2]
ld1w z1.s, p0/z, [x0, x3, lsl #2]
fmla z1.s, p0/m, z0.s, z2.s
st1w z1.s, p0, [x0, x3, lsl #2]
incw x3
b.any .Lsveloop
Avoid -march=native for distributable binaries; it locks output to build host features.
-fno-omit-frame-pointer
Why it helps: stable unwind chains for profilers and postmortem stacks.
stp x29, x30, [sp, #-32]!
mov x29, sp
...
ldp x29, x30, [sp], #32
ret
What you trade away
Assembly translation: one extra reserved register and extra prologue/epilogue work.
When it backfires: tiny runtime cost in very hot functions, but often worth it in production diagnostics.